-
Notifications
You must be signed in to change notification settings - Fork 452
Optimize simple time ranged search queries #5759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Optimize simple time ranged search queries #5759
Conversation
804be29
to
61c3b74
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR. The way the logic is split here between optimize_split_order
and optimize
makes it very hard to proof read. Currently it's already a bit confusing because the optimize()
logic for each variant depends on the sort order which is different for each variant. It would be good if we could avoid tightening the coupling between the two match
statements even further by making the sort orders more complex.
quickwit/quickwit-search/src/leaf.rs
Outdated
let min_required_splits = splits | ||
.iter() | ||
// splits are sorted by whether they are contained in the request time range | ||
.filter(|split| Self::is_contained(split, &request)) | ||
.map(|split| split.num_docs) | ||
// computing the partial sum | ||
.scan(0u64, |partial_sum: &mut u64, num_docs_in_split: u64| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why this works? Say you have 5 splits:
- 1 is contained in the request
- 4 are only overlapping the request
In that case min_required_splits
would be 1 here, but there might be a biggest_end_timestamp
or smallest_start_timestamp
in any of the other splits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thoughts were that there's a
.take_while(|partial_sum| *partial_sum < num_requested_docs)
so min_required_splits
would not get to be 1, but now I see that I should validate that we actually reached this condition, and only if we reached it to set min_required_splits
and optimize, otherwise return early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a quick fix, I need to test it, but I think it answers the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know of std::ops::ControlFlow
😄. Nevertheless, this reaches the limit when iterators stop being practical. A for
loop is much more readable here.
61c3b74
to
851c15c
Compare
It is a bit confusing, I'll try to think of a better way to do this when I have some more time. |
103b8d9
to
69e85ff
Compare
I would try refactoring this into:
where
This would regroup the optimizations logics with the sorts they depend on to be correct. (To help the review, ideally, re-implement the current logic in 1 commit, and in a separate commit add you logic extension.) Thanks! |
ed6cd7b
to
3055965
Compare
cb52ca6
to
6b427b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This is a bit more verbose, but I find it easier to read!
Check the comment marked "MORE IMPORTANT" first 😉
quickwit/quickwit-search/src/leaf.rs
Outdated
if partial_sum >= num_requested_docs { | ||
return Some(min_required_splits + 1); | ||
} | ||
CanSplitDoBetter::SplitTimestampLower(_) => { | ||
splits.sort_unstable_by_key(|split| split.timestamp_start()) | ||
|
||
min_required_splits += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, this is more readable
min_required_splits += 1;
if partial_sum >= num_requested_docs {
return Some(min_required_splits);
}
fn get_min_required_splits( | ||
splits: &[SplitIdAndFooterOffsets], | ||
request: &SearchRequest, | ||
) -> Option<usize> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, this is now really straightforward to understand and verify 😄
quickwit/quickwit-search/src/leaf.rs
Outdated
let contained = Self::is_split_contained_in_search_time_range(split, &request); | ||
(!contained, std::cmp::Reverse(split.timestamp_end())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is really hard to know here if contained splits come first or second (the ordering of boolean is not obvious, the !
negation doesn't help 😄 ). Could we maybe explicitly map true/false to something that is more clearly ordered:
let contained_first = match Self::is_split_contained_in_search_time_range(split, &request) {
true => 0u8,
false => 1u8,
};
(contained_first, std::cmp::Reverse(split.timestamp_end()))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MORE IMPORTANT: now that I think about it, I thing it's not good idea to have contained splits processed first in the general case (!is_simple_all_query
or get_min_required_splits().is_none()`). You miss an optimization opportunity, because once you'll have processed them, you will still need to check the splits overlapping the higher end of the range for better matches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, we can maybe instead of sorting by contained, keep the same sort mechanism as it was.
but still keep the new implementation of get_min_required_splits
, meaning we won't optimize the first few splits which might not be contained in the search query's range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made the change to not sort, tell me what you think.
splits | ||
.into_iter() | ||
.map(|split| (split, (*request).clone())) | ||
.collect::<Vec<_>>() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep the comment:
// TODO: we maybe want here some deduplication + Cow logic
quickwit/quickwit-search/src/leaf.rs
Outdated
fn optimize_split_id_higher( | ||
&self, | ||
request: Arc<SearchRequest>, | ||
mut splits: Vec<SplitIdAndFooterOffsets>, | ||
) -> Result<Vec<(SplitIdAndFooterOffsets, SearchRequest)>, SearchError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could actually call to_splits_with_request
before (inside optimize
) and simplify this signature to something like:
fn optimize_split_id_higher( | |
&self, | |
request: Arc<SearchRequest>, | |
mut splits: Vec<SplitIdAndFooterOffsets>, | |
) -> Result<Vec<(SplitIdAndFooterOffsets, SearchRequest)>, SearchError> { | |
fn optimize_split_id_higher( | |
&mut [(SplitIdAndFooterOffsets, SearchRequest)] | |
) -> Result<(), SearchError> { |
This would prevent quite a lot of repetition and is also more explicit because it shows that we don't drop splits (note that &self
is also dropped because not used)
(applies to all optimzie_split_xxx
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified to remove &self
, tried what you suggested, but then simple stuff like calling is_simple_all_query
that accepts a &SearchRequest
became ugly, WDYT?
4758505
to
b7ea08e
Compare
b7ea08e
to
f066b6c
Compare
When the search request contains a time range, we aborted the optimization of converting unneeded split searches into count queries.
f066b6c
to
110b53a
Compare
When the search request contains a time range, we aborted the optimization of converting unneeded split searches into count queries.
Removed the
TODO
that was in the code for this.Split the PR into 2: #5758.